Data Visualisation 1

Ben Whalley

October 2022

Inspiration

In the session we watched Hans Rosling’s “200 countries and 200 years in 4 minutes”, which we (hopefully) agreed is something to aspire to. Combined with his enthusiastic presentation, the visualisations in this clip support a clear narrative and help us understand a complex dataset.

The plot he builds is interesting because it uses many different visual attributes (aesthetics) to express features in the data, including:

  • X and Y axes
  • Colour
  • Size of the points
  • Time (in the animation)

These features are carefully selected to highlight important features of the data and support the narrative he provides. Although we need to have integrity in our plotting (we discuss bad examples in the session), this narrative aspect of a plot is important: we always need to consider our audience.

Before you start

Use the ‘files pane’ in RStudio to make a new folder on the RStudio server to save your work. Call this datafluency2022.

Inside this new folder, make a new RMarkdown file (use the ‘file’ menu and choose ‘new’). When you save the file make sure it has the extension .rmd, so call it datavis.rmd for example.

Use this new .rmd file to save your work during this session.

Recreate the Rosling plot

“Multi-dimensional plotting” sounds fancy, but it just means linking different visual features of a plot to columns in the dataset.

In the example, Rosling’s plot is appealing and informative because it adds multiple dimensions, and uses a special logarithmic scale for the x-axis.

  • The Rosling plot shown above has dimensions

  • The additional dimension in the plot shown in the BBC video is:

  • x axis (GDP)
  • y axis (life exp.)
  • color (continent)
  • size (population)

Time is the extra dimension, because it shows these other values changing across years.

Defining dimensions/aesthetics

As a reminder: ggplot uses the term aesthetics to refer to different dimensions of a plot.

Aesthetics’ refers to ‘what things look like’, and the aes() command in ggplot maps variables (columns in the dataset) to visual features of the plot.

There are 4 visual features (aesthetics) of plots we will use in this session:

  • x and y axes
  • colour
  • size (of a point, or thickness of a line)

We could also use:

  • shape (of points)
  • linetype (of added lines, i.e. dotted/patterned or solid)



Additionally, we will control the scale of the axes in the plot to improve the presentation of the data.

Rosling’s plot looked something like this:

To create a (slightly simplified) version of the plot above, the code would look something like this:

development %>%
  filter(BLANK==BLANK)  %>%
  ggplot(aes(x=BLANK,
             y=BLANK,
             size=BLANK,
             color=BLANK)) +
  geom_point() + 
  scale_x_log10() + 
  labs(x=BLANK, y=BLANK, color=BLANK, size=BLANK)

I have removed some parts of the code. Your job is to edit the parts which say <BLANK> and replace them with the names of variables from the development dataset (available in the psydata package).

Some hints:

  • All the BLANKs represent variable names in the dataset. You can see a list of the column names available by typing glimpse(development)
  • If you are confused by the filter(BLANK==BLANK) check the title of the plot above. Remember that filter selects particular rows from the data, so we can use it to restrict what is shown in the plot. What data do we need to select for this plot?
  • The part which reads scale_log_10() is explained in more detail below
  • The part which reads labs(...) is explained below.

Video hints:

Using multiple layers

When visualizing data, there’s always more than one way to do things. As well as plotting different dimensions, different types of plot can highlight different features of the data. In ggplot, different types of plots are called geometries. Multiple layers can be combined in the same plot by adding together commands which have the prefix “geom_”.

As we have already seen, geom_point() is used to create a scatter plot:

To add additional layers to this plot, we can add extra geom_<NAME> functions. For example, geom_smooth is used to overlay a smooth line to any x/y plot:

fuel %>% 
  ggplot(aes(weight, mpg)) + 
  geom_point() +
  geom_smooth()

Explanation of the command: We added + geom_smooth() to our previous plot. This means we now have two geometries added to the same plot: geom_point and geom_smooth.

Explanation of the output: The plot shown is the same as the previous scatterplot, but now has a smooth blue line overlaid. This represents the local-average of mpg, for each level of weight. There is also a grey-shaded area, which represents the standard error of the local average (again there will be more on this later in the course).


We could add other layers to the plot with other geom_ functions.

For example, we could calculate the average mpg and weight of all cars in the dataset and overlay that as a horizontal or vertical line:

med_mpg <- fuel %>% summarise(median(mpg)) %>% pull(1)

fuel %>% 
  ggplot(aes(weight, mpg)) + 
  geom_point() +
  geom_smooth() + 
  geom_hline(yintercept = med_mpg, color="red") 

Explanation of the code First we calculate the median of mpg using summarise. Then we use the pull(1) command for the first time. This selects the first column of data (containing our median) and returns only that. That is, it returns only a number, or a list of numbers, rather than the whole dataframe. Then We added geom_hline and geom_vline functions to our existing plot. We set the yintercept option to the stored med_mpg value, and this defines the height of the line. We added color="red" to make these added lines distinctive.

Explanation of the output The plot is the same as before, but now has a red line marked at the mean of the mpg column. The red line is on top of the other plot elements because we added this line at the end of our plotting code.

Make your own layered plot

  1. Use the mentalh data-set from psydata. Create a scatter plot of screen time and anxiety scores, adapting the code above.

  2. Add a smoothed line to the plot using geom_smooth()

  3. Colour the plot, using the education variable.

  4. Add a horizontal line to the plot, with the yintercept set to the average anxiety score.

med_anx <- mentalh %>% summarise(median(anxiety_score)) %>% pull(1)

mentalh %>% 
  ggplot(aes(screen_time, anxiety_score, color=education)) + 
  geom_point() + 
  geom_smooth(se=F) + 
  geom_hline(yintercept = med_anx)

Scales

Incomes are “not normal”

If we re-plot the development data using the default settings in ggplot you might notice that the result looks quite different to the one shown in the video or the exercise above.

In particular, the placement of the points looks quite different:

Specifically, we can see that in the left hand panel the points are mostly compressed to the left hand of the frame. In contrast, in the original plot, the points are fairly evenly spread across the x-axis.

These plots are showing the same data. The only difference is that the original plot uses a log scale.


We can recognise the log scale by looking at the markings on the x axis:

  • In the left hand panel the markings go up by 30,000 each time.
  • In the original plot, each marker represents a value 10 times larger than the previous. So, 1000, 10,0000 and 100,000 (the values shown are in scientific notation, so 1e+03 means 1000).

Skewed distributions

Another way to see why this helps is to plot the distribution of incomes:

development %>% 
  ggplot(aes(gdp_per_capita)) + 
  geom_histogram()

Explanation of the code We used the geom_histogram function to make histogram of the GDP per capita variable.

Explanation of the output The histogram shows that most GDP values are below $20,000, but a small number are much, much larger (i.e. > $100,000). This is quite typical of incomes data.

We can then add scale_x_log10() to the same plot:

development %>% 
  ggplot(aes(gdp_per_capita)) + 
  geom_histogram() + 
  scale_x_log10()

Explanation of the code We make another, histogram this time adding scale_x_log10().

Explanation of the output The plot changes, and the distribution is less skewed. We can see that the scale markers are again in scientific notation, and the gaps between points of the scale are not equal: each point on the scale is 10 times larger than the previous one (1000, 10,000, 100,0000), stretching out the values across the x axis and reducing the skew.

Reaction times

In the example above we saw that incomes were not normally distributed and benefited from a log scale. Another common example of ‘non-normal’ data are those from reaction time studies.

For example:

rtdata <- read_csv('https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv')

rtdata %>% 
  ggplot(aes(rt)) + 
  geom_histogram()

Explanation of the code The line with read_csv takes a web address (URL) and reads a ‘comma separated values’ data file from it. The next part makes a histogram using the reaction time (RT) data it contains. In case it is truncated in the output above, the full url is: https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv

Explanation of the output The RT data are strongly skewed. Consequently the median, 2.01, is lower than the mean value, 2.51.

  1. Copy and paste the line of code which reads the CSV data and run it.
  2. Recreate the histogram above (use geom_histogram) or a density plot (use geom_density)
  3. Add the correct scale function to recreate this plot:

rtdata <- read_csv('https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv')

rtdata %>% 
  ggplot(aes(rt)) + 
  geom_histogram() + 
  scale_x_log10()

Or

rtdata <- read_csv('https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv')

rtdata %>% 
  ggplot(aes(rt)) + 
  geom_density() + 
  scale_x_log10()

`

Animation!

This is an optional exercise and is not required for the course assessment. Skip to the final section if you are short on time.


The gganimate package allows us to create animations using ggplot. The package has good documentation here: https://gganimate.com.

As an example we can load the package:

library(gganimate)

And then adapt our previous ggplot by adding transition_time(year). This adds year as a time-based dimension, animating the plot.

We need to save the resulting plot in a variable, and then send that to the animate function. Here we use a variable called progress_plot.

progress_plot <- development %>%
  ggplot(aes(gdp_per_capita, life_expectancy, color=continent)) +
  geom_point() +
  scale_x_log10() +  
  transition_time(year)

progress_plot %>% animate()

Try to animate the following plot using data from all the years in the development dataset. To do this, amend the code below, referring to the example above:

development %>%
  filter(year == 1952) %>% 
  ggplot(aes(continent, life_expectancy)) +
  geom_boxplot() +
  labs(title="Year: 1952")

You need to:

  • Remove the filter
  • Use the transition_time() function

library(tidyverse)
library(psydata)
library(gganimate)

p <- development %>%
  ggplot(aes(continent, life_expectancy)) +
  geom_boxplot() + 
  labs(title = "Year: {frame_time}") + 
  transition_time(year)

p %>% animate()

Graphics are for answering questions

Note: you do not need to have completed the extension exercise using animation to engage with this activity.

Consider the following plots, which include the animation we made above:

Animated boxplot

Boxplot

Line graph

This plot shows the median life expectancy in 1952 and 2002.

Ribbon plot

This plot shows the median (line) and IQR (shaded area) in all available years:

development %>%
  ggplot(aes(year, life_expectancy, fill=continent, color=continent)) +
  stat_summary(geom="ribbon", fun.data = median_iqr, alpha=.2, linetype=0) + 
  stat_summary(geom="line", fun.data=median_iqr) + 
  theme_minimal() +
  labs(title = "Life expectancy between 1952 and 2022 (median and IQR)", x="Year", 
       y="Life expectancy (years)", 
       fill="Continent", color="Continent") 

  1. Create a table of the strengths and weaknesses of each plot, like so:
Plot Strengths Weaknesses
Animated boxplot
Boxplot
Line graph
  1. Think of some questions that can be answered from these data? For example:
  • Which continent changed most between 1952 and 2002?
  • Which continent has the most/least variability in life expectancy?
  • Which continents became more/less homogeneous over this period?

And so on… Make a list of at least 5 or 6 questions which someone might want to know the answer to.

  1. Which plots are most effective in answering each of your questions?

  2. Imagine you were a journalist writing a story with the title “Asia sees fastest rise in life expectancy since WW2”. Can you think of any reasons to prefer the line graph over the boxplot?

Check your knowledge

  1. What does .Rmd stand for (what type of file is it)?
  2. Give three examples of visual ‘aesthetics’ in which can be adjusted in ggplot
  3. Which is correct: labs(x=VARIABLE_NAME) or labs(x="Variable name"). Why is this?
  4. What symbol do we use to add a geom_point() layer to a plot? Is it %>% or +?
  5. What value do we need to set for a geom_hline() layer when adding it to a plot?
  6. Explain an advantage of using a log-scale plot for income data? Give an example of another type of data, common in psychology, where a log scale might also be useful.
  7. What sort of plots would allow us to check if a distribution is skewed?